Evolution of Phage Tail Sheath Protein

Sheath proteins comprise a part of the contractile molecular machinery present in bacteriophages with myoviral morphology, contractile injection systems, and the type VI secretion system (T6SS) found in many Gram-negative bacteria. Previous research on sheath proteins has demonstrated that they share common structural features, even though they vary in their size and primary sequence. In this study, 112 contractile phage tail sheath proteins (TShP) representing different groups of bacteriophages and archaeal viruses with myoviral morphology have been modelled with the novel machine learning software, AlphaFold 2. The obtained structures have been analysed and conserved and variable protein parts and domains have been identified. The common core domain of all studied sheath proteins, including viral and T6SS proteins, comprised both N-terminal and C-terminal parts, whereas the other parts consisted of one or several moderately conserved domains, presumably added during phage evolution. The conserved core appears to be responsible for interaction with the tail tube protein and assembly of the phage tail. Additional domains may have evolved to maintain the stability of the virion or for adsorption to the host cell. Evolutionary relations between TShPs representing distinct viral groups have been proposed using a phylogenetic analysis based on overall structural similarity and other analyses.


Introduction
Tail sheath proteins (TShP) have a particular role in the structural biology of phages as a molecular engine for viral infection. The first object in the study of sheath proteins was gp18 from the classic phage T4. Unfortunately, it was not the best choice, because the recombinant protein tended to polymerise, forming polysheaths. Therefore, the structure was determined for a protease-resistant fragment (amino acids 83-365) and a deletional mutant of this protein (residues 1-510) [1]. Thus, the fine details of tail sheath contraction remained understudied. Meanwhile, it was shown that less complicated sheath proteins from prophages do not form polymeric structures, and their crystal structures could be revealed (3HXL and 3LML). X-ray analysis of the protease-resistant fragment of phage phiKZ TShP allowed scientists to compare the structures of sheath proteins from distant phages and revealed a common fold in this type of protein [2].
Development of cryo-electron microscopy enabled the reconstruction of whole particles of many bacteriophages with near to atomic resolution. Therefore, it was possible to work with the structures of a number of proteins from different phages and bacteriocins (T4, 812, A511, anti-feeding prophage, PVC). Some contractile systems have been shown to be organised from several sheath proteins combined to form complex structures with specific intermittence of layers [3,4]. The tail sheaths of some phages have a complicated morphology with sheath proteins encrusted by proteinaceous fibres (AR9, PBS1 [5]).
Natural contractile injection systems (CISs) are numerous and diverse. They can be further subdivided into: (i) those mediating bacterial cell-cell interactions, such as type VI secretion systems (T6SSs), and (ii) extracellular CISs (eCISs). All these nanomachines possess their own sheath proteins, which are often different from phage-originated ones, Escherichia phage T4 1.8 Å X-ray diffraction [1] 3FOA Crystal structure of the bacteriophage T4 tail sheath protein, deletion mutant gp18M Escherichia phage T4 3.5 Å X-ray diffraction [1] 3HXL Crystal structure of the sheath tail protein (DSY3957) from Desulfitobacterium hafniense Desulfitobacterium hafniense 1.90 Å X-ray diffraction [46] 3LML Crystal structure of the sheath tail protein Lin1278 from Listeria innocua, Northeast Structural Genomics Consortium Target LkR115 Listeria innocua 3.3 Å X-ray diffraction [47] 3SPE Crystal structure of the tail sheath protein protease-resistant fragment from bacteriophage phiKZ Pseudomonas phage phiKZ 2.4 Å X-ray diffraction [2] 5LI4 Bacteriophage phi812K1-420 (Staphylococcus phage 812) tail sheath protein after contraction. This structure is related to 5LI2, 5LII, 5LIJ Staphylococcus phage 812 4.2 Å Electron microscopy [48] 6GKW Crystal structure of the R-type bacteriocin (diffocin) sheath protein CD1363 from Clostridium difficile 630 in the pre-assembled state Clostridium difficile 1.9 Å X-ray diffraction [49] 6PYT CryoEM structure of precontracted pyocin R2 trunk from Pseudomonas aeruginosa Pseudomonas aeruginosa 2.9 Å Electron microscopy [50] 3J9O CryoEM structure of a type VI secretion system from Francisella tularensis subsp. novicida U112 Francisella tularensis subsp. novicida 3.70 Å Electron microscopy [51] 5N8N CryoEM structure of contracted sheath of a Pseudomonas aeruginosa type VI secretion system consisting of TssB1 and TssC Pseudomonas aeruginosa 3.28 Å Electron microscopy [52] 3J9G Atomic model of the VipA/VipB, the type VI secretion system contractile sheath of Vibrio cholerae Vibrio cholerae 3.5 Å Electron microscopy [53] 6RAO Cryo-EM structure of the anti-feeding prophage (AFP) baseplate for Serratia entomophila. This structure is related to 6RAP, 6RBK, 6RBN, 6RC8, 6RGL Serratia entomophila 3.1 Å Electron microscopy [4] Viruses 2022, 14 [3] 7AE0 Cryo-EM structure of an extracellular contractile injection system from the marine bacterium Algoriphagus machipongonensis with the sheath-tube module in its extended state. This structure is related to 7ADZ, 7AE0, 7AEB, 7AEF, 7AEK Algoriphagus machipongonensis 2.4 Å Electron microscopy [54] 7B5I Cryo-EM structure of the contractile injection system cap complex from Anabaena PCC7120 Nostoc sp. 2.8 Å Electron microscopy [55] Viruses 2022, 14 (a) 3FOA, crystal structure of the bacteriophage T4 tail sheath protein, deletion mutant gp18M; 3HXL, crystal structure of the sheath tail protein (DSY3957) from Desulfitobacterium hafniense; 3LML, crystal structure of the sheath tail protein Lin1278 from Listeria innocua; 5LI4, bacteriophage phi812K1-420 tail sheath protein after contraction; 6GKW, crystal structure of the R-type bacteriocin sheath protein CD1363 from Clostridium difficile in the pre-assembled state; 6PYT, cryoEM structure of precontracted pyocin R2 trunk from Pseudomonas aeruginosa. (b) 3J9G, sheath protein (VipB) from the type VI secretion system of Vibrio cholerae; 3J9O, sheath protein (IglB) from the type VI secretion system of Francisella tularensis subsp. novicida; 5N8N, sheath protein (TssC) from the type VI secretion system of Pseudomonas aeruginosa; 6RAO_E, 6RBN_C, 6RBN_D, three sheath proteins of the anti-feeding prophage (AFP) of Serratia entomophila. The models are coloured based on a rainbow gradient scheme, where the Nterminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. (a) 3FOA, crystal structure of the bacteriophage T4 tail sheath protein, deletion mutant gp18M; 3HXL, crystal structure of the sheath tail protein (DSY3957) from Desulfitobacterium hafniense; 3LML, crystal structure of the sheath tail protein Lin1278 from Listeria innocua; 5LI4, bacteriophage phi812K1-420 tail sheath protein after contraction; 6GKW, crystal structure of the R-type bacteriocin sheath protein CD1363 from Clostridium difficile in the pre-assembled state; 6PYT, cryoEM structure of precontracted pyocin R2 trunk from Pseudomonas aeruginosa. (b) 3J9G, sheath protein (VipB) from the type VI secretion system of Vibrio cholerae; 3J9O, sheath protein (IglB) from the type VI secretion system of Francisella tularensis subsp. novicida; 5N8N, sheath protein (TssC) from the type VI secretion system of Pseudomonas aeruginosa; 6RAO_E, 6RBN_C, 6RBN_D, three sheath proteins of the anti-feeding prophage (AFP) of Serratia entomophila. The models are coloured based on a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. The sheath proteins of Serratia entomophila AFP and Photorhabdus asymbiotica CIS are comprised of three proteins encoded by three adjacent genes located in the contractile molecular machine genes cluster. The alignment of the amino acid sequences for sheath proteins encoded by different genes indicate their relatedness with one another and show a well-marked homology between the proteins belonging to these two species.

Positioning of the Conserved Core in Experimentally Determined TShPs
Superimposition of the structures depicted in Figure 1 indicated distinct structural similarities for sheath proteins belonging to phage tails, T6SS, and the extracellular contractile injection system. Structural alignment of the experimentally acquired structures clearly showed the presence of a conserved core shared by all aligned proteins (Figure 2a). Several determined structures lacked some residues but structural alignment using the AlphaFold 2 models showed the conserved core to a fuller extent (Figure 2b). In particular, the conserved part of the protein from Escherichia phage T4 and Staphylococcus phage 812 is interrupted by long insertions (Figure 2c) and includes residues located in both the N-terminal and C-terminal parts. Remote contacts between the different regions illustrated in protein topology graphs demonstrate mixed connections between the N-terminal and C-terminal parts of the experimentally determined structure of the TShP deletion mutant 3FOA and mostly antiparallel connections within the central regions of the strands ( Figure 3).
To clarify the position of the conserved common core in the phage tail, the previously published results [56] for the cryo-EM reconstruction of the extended (3J2M) and contracted (3J2N) tail of phage T4 were used (Figure 4a). The original reconstruction contains the fitted model of the tail sheath protein built on the basis of the experimentally determined structure and the results of structure modelling. The superimposed AlphaFold 2 model of the phage T4 tail sheath protein was also used (Figure 4b). This model is similar to the original model used.
A visual analysis of the models obtained ( Figure 4b) shows that the conserved core is closer to the tail tube than the other parts of the protein. The N-terminus is located more distantly from the tail tube proteins than the C-terminus, but the domains outside of the common core are placed even farther from the tail tube. This indicates that interactions may exist between the tail tube and tail sheath proteins, and the common core part of the sheath protein may be important for correct phage tail assembly. The cryo-EM reconstruction of Staphylococcus phage 812 indicates a similar layout [48].

Choosing Representative Sequences for Modelling
At the beginning of 2022, the classification of bacteriophages approved by the International Committee on Taxonomy of Viruses (ICTV) [57] included four families of phages with myoviral morphology, namely, Myoviridae, Ackermannviridae, Chaseviridae, and Herelleviridae. In addition, a gene encoding a tail sheath protein with myoviral morphology was found in the genome of Paenibacillus phage Lily [58], comprising the singleton Lilyvirus genus, not assigned to any phage family [57]. This phage was first reported as a siphovirus, but its genome shows a high level of similarity with Paenibacillus phage ERIC V (81.6% average nucleotide identity, according to orthoANI calculations) [59]. The latter is classified as a member of the Myoviridae family. Ackermannviridae, Chaseviridae, and Herelleviridae groups were delineated from the Myoviridae family. The re-evaluation of bacteriophage taxonomy continues. Currently, the Myoviridae group seems to be the most diverse. This diversity can be explained by the fact that, at the present time, the formation of new taxa is based on genomic/proteomic features, whereas the contractile tail, a hallmark of myoviruses, is a morphological property. A recent 2021 ICTV proposal (not yet ratified) suggests abolishing the definition of Myoviridae as a virus family, leaving a taxonomical gap between class Caudoviricetes and subfamilies/separate genera for describing phages with myoviral morphology. Nevertheless, the contractile tail is an important structural and functional feature, which especially concerns the subject of this discussion. Therefore, the Myoviridae term will be retained for the purposes of the current paper. An analysis of the alignments and HMM-HMM motif comparisons have indicated that the TShPs of phages belonging to the Ackermannviridae, Chaseviridae, and Herelleviridae families possess conspicuous similarities to one another within those groups, whereas the TShPs of phages belonging to the other myoviruses are the most diverse. At the beginning of January 2022, the GenBank phage database contained 278 entries attributed as Ackermannviridae, 34 entries attributed as Chaseviridae, 509 entries attributed as Herelleviridae, and 5723 entries attributed as Myoviridae. , coloured according to a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, the C-terminus is coloured red, and the model superimposed with the "common core" of the experimentally determined sheath is coloured magenta.   Special attention has been paid to archaeal viruses because of their great significance for evolutionary biology. Many archaeal viruses are morphologically indistinguishable from tailed bacteriophages [20,60,61], and a genomic analysis of archaeal myoviruses has indicated the presence of tail sheath proteins reminiscent of some bacterial myophages.
In January 2022, the GenBank phage database contained 43 complete genomes for archaeal myoviruses. The genomes of the viruses listed below encode distinguishable putative tail sheath proteins:

2.
Halobacterium phages of Myohalovirus genus: two complete genomes (phiH and ChaoS9) contain two TSPs of about 430 aa length that have 52% identity and identical HHM-HHM motif comparison results obtained with HHpred [28].

4.
Halorubrum   Figure 4a but superimposed with the Al-phaFold 2 model of the T4 tail sheath. The AlphaFold 2 model is coloured based on a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. The conserved core is circled red. TT, tail tube proteins; TshP, tail sheath proteins.

Choоsing Representative Sequences for Modelling
At the beginning of 2022, the classification of bacteriophages approved by the International Committee on Taxonomy of Viruses (ICTV) [57] included four families of phages with myoviral morphology, namely, Myoviridae, Ackermannviridae, Chaseviridae, and Herelleviridae. In addition, a gene encoding a tail sheath protein with myoviral morphology was found in the genome of Paenibacillus phage Lily [58], comprising the singleton Lilyvirus genus, not assigned to any phage family [57]. This phage was first reported as a siphovirus, but its genome shows a high level of similarity with Paenibacillus phage ERIC V (81.6% average nucleotide identity, according to orthoANI calculations) [59]. The latter is classified as a member of the Myoviridae family. Ackermannviridae, Chaseviridae, and Herelleviridae groups were delineated from the Myoviridae family. The re-evaluation of bacteriophage taxonomy continues. Currently, the Myoviridae group seems to be the most diverse. This diversity can be explained by the fact that, at the present time, the formation of new taxa is based on genomic/proteomic features, whereas the contractile tail, a hallmark of myoviruses, is a morphological property. A recent 2021 ICTV proposal (not yet ratified) suggests abolishing the definition of Myoviridae as a virus family, leaving a taxonomical gap between class Caudoviricetes and subfamilies/separate genera for describing  [56] in the extended (3J2M) and contracted (3J2N) states. (b) The same as Figure 4a but superimposed with the AlphaFold 2 model of the T4 tail sheath. The AlphaFold 2 model is coloured based on a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. The conserved core is circled red. TT, tail tube proteins; TshP, tail sheath proteins.
Representatives from all of the groups listed above were used for modelling. A BLAST search using the GenBank Bacterial database, containing archaeal and bacterial chromosomes and plasmid sequences, revealed that putative TShPs were encoded in Natronorubrum bangense strain JCM10635, Methanolacinia petrolearia DSM 11571, and other Euryarchaeota. The primary sequences of archaeal TShPs are often distant from known bacterial myovirus TShPs. Apparent homologs of archaeal TShPs have, however, been found in Pseudomonas phages belonging to the genus Otagovirus (for example, phage PPSC2), plasmids of Clostridium baratii str. Sullivan, and other bacterial plasmids and chromosomes.
Interestingly, homologs of sheath proteins can also be found in archaeal genomes that are being part of the Asgard group Lokiarchaeota, Thorarchaeota, Crenarchaeota, Bathyarchaeota, and Pacearchaeota. Functional assignments for these homologous proteins were predicted by a BLAST search and HHM-HHM motif comparison.
After a preliminary analysis, about 2000 TShP sequences extracted from annotated or re-annotated viral and prokaryotic genomes, both predicted and experimentally found, were used for fast phylogenetic tree construction by means of FastTree [34]. Some phage genomes encoded two copies of tail sheath proteins. It has been shown for several Jumbo phages that they also arose by gene duplication [62]. In those cases where the phage genome encoded more than one TShP, only one was used for further analysis. A total of 109 phage sequences representing different clades of the tree, phage hosts, and taxa were selected. This included the various representatives of archaeal and bacterial Myoviridae, Ackermannviridae, Chaseviridae, and Herelleviridae families, and genus Lilyvirus. Archaeal proteins were used for a BLAST search of the archaeal and bacterial GenBank database to find homologs in archaeal and bacterial genomes. Three putative TShP sequences were added from the list of Jumbo phages predicted by the metagenome analysis in [63]. Sequences for experimentally found sheath proteins were also used to search for homologs in the genomes of bacteria and archaea.
In addition, phages belonging to the recently established Schitoviridae family of N4-like phages [64] possess a receptor known as the "non-contractile tail sheath protein" [65]. Two of these proteins, from phages Escherichia phage AlfredRasser (subfamily Enquatrovirinae, genus Enquatrovirus) and Delftia phage RG-2014 (genus Dendoorenvirus), were taken for further analysis. The sequences shown in Figure 2c and experimentally determined earlier were also taken for modelling using translated genes extracted from the corresponding genomes.
The total number of selected sequences was 155. This included 114 phage tail sheath proteins (112 contractile and 2 non-contractile), 25 sheath proteins homologous to archaeal tail sheaths from archaeal and bacterial chromosomes and plasmids, 8 sheath proteins from the type VI secretory system, 6 proteins from the extracellular contractile injection system (anti-feeding prophage), and 2 sequences for sheath proteins from bacteriocins (pyocin and diffocin). The functional assignments of all selected proteins were confirmed with a BLAST search and HHM-HHM motif comparison.

Modelling and General Structural Analysis of Representative Sheath Proteins
Visual analysis of the modelled contractile sheath proteins revealed the different structural architecture of the models. All shared the conserved domain, composed of the Nterminal and C-terminal parts, and some had additional domains (from the point of view of a domain as a compact structure [66]). In a few cases, the modelled structures did not have clearly distinguishable domains, but in most cases, it was possible to estimate the number of domains. As expected, the so-called "non-contractile" receptor-binding "tail sheath protein" of Schitoviridae phages had a completely different fold and was not analysed further. Examples of the structural architecture for the modelled contractile phage sheath proteins are shown in Figure 5. The PDB files of all best-ranked modelled structures and FASTA sequences are included in the Supplementary Data (Supplementary Files S1 and S2).

One-Domain Contractile Sheath Proteins (Type 1)
The smallest modelled bacteriophage sheath protein belongs to a representative of the Tigrvirus genus of the Peduovirinae subfamily of the Myoviridae family Burkholderia phage BEK. It consists of 341 aa and its spatial structure is very close to the structural common core (Figure 6a). Most of the protein has a structural similarity with the conserved core of experimentally determined structures found by alignment and shown in Figure 2c. The structural architecture of sheath proteins from bacteriocins, T6SS, and anti-feeding prophage can be described as one-domain structures.
This one-domain structure is shared by modelled Peduovirinae TShPs, representing seven genera of this subfamily. This type of structure is shared by a number of other bacteriophages and sheath proteins predicted in the genomes of bacteria and archaea (Lokiarchaeota, Bathyarchaeota, Euryarchaeota) ( Figure 6, Table 2). The length of the modelled type 1 TShPs varies in the range 321-410 aa. The Candidatus Bathyarchaeota archaeon protein is structurally very similar to Burkholderia phage BEK (RMSD 1.3 Å), but the Burkholderia phage BEK TShP possesses an additional short N-terminal part of about 30 aa. The Candidatus Bathyarchaeota archaeon predicted sheath protein is the shortest modelled sequence with a length of 321 amino acid residues.

Two-Domains Contractile Sheath Proteins (Type 2)
The remaining modelled contractile sheath proteins possess the part that is structurally similar to that of the one-domain contractile sheath proteins, which will be referred to as the "main domain", but they also possess additional domains. For some modelled structures, it was not possible to determine clearly whether the part of the protein excluding the main domain can be counted as a single domain. It might be related to the complex composition of the remaining part or it was caused by the inaccuracy of modelling. The structural architecture of most of the remaining sheath proteins can, however, be described as consisting of two domains, one of which is the main conserved domain. As a rule, the additional domain included β-sheets-related motifs and often contained immunoglobulinlike (Ig-like) β-sandwiches and short α-helical parts.
All isolated archaeal viruses contained type 2 sheath proteins. Currently, isolated archaeal myoviruses are described as infecting Halobacteria. In addition, type 2 sheath proteins were found in phages assigned to the Chaseviridae family, different Myoviridae genera, Paenibacillus phage Lily, chromosomes and genome assemblies of Gram-positive and Gram-negative bacteria, and archaea attributed to phyla Crenarchaeota, Euryarchaeota, Thaumarchaeota, and Thorarchaeota (Figure 7, Supplementary File S1). The type 2 sheath proteins vary from 426 aa to 713 aa in size. The largest type 2 sheath proteins basically belong to Jumbo phages infecting gammaproteobacteria. The type 2 TShPs from isolated archaeal viruses were smaller than most other type 2 TShPs and contained an additional domain that was basically composed of β-sheets forming a β-sandwich.

Multiple Domain Contractile Sheath Proteins (Type 3)
More than a third of the modelled structures showed a more complicated architecture than type 1 and type 2 proteins. The structural architecture of these proteins appears to be a further evolutionary development of type 2, and this architecture will be referred to as "type 3". The modelled type 3 sheath proteins form a multi-domain structure composed of three and more domains ( Figure 8). As in the case of type 2 sheath proteins, the additional domains often possessed an Ig-like β-sandwich structure sometimes accompanied by a few α-helices, but the sheath proteins from two related (ANI 99.0%) Bacillus phages, AR9 and PBS1, included additional domains comprised of mainly α-helices ( Figure 9). As in the case of type 2 sheath proteins, as well as experimentally determined structures (Figure 4), the additional domains were located away from the part of the sheath protein that can contact the tail tube. Most phage genomes above 100 kbp in size encoded sheath proteins with three or more domains. The highest number of domains, five and more, were found in Ackermannviridae phages (genome size of approximately 140-170 kbp) and Jumbo phages (genome size of 200 kbp and bigger). Variants in the structural architecture of these proteins included additional domains formed by one region of the polypeptide chain, or two regions, one of which was closer to the N-terminus, and the other belonged to the returning part of the polypeptide chain located closer to the C-terminus.

Phylogenetic Analysis of Modelled Sheath Proteins
Multiple structural alignment with mTM-align algorithms [39,40] records pairwise TM-scores, a length-independent scoring function for measuring the similarity of two structures [67]. A matrix containing pairwise TM-scores was used for BioNJ clustering and for making inferences about a phylogenetic tree based on structural similarity ( Figure 10). This tree differentiates the phage tail sheath proteins and other sheath proteins from type VI secretory system proteins and shows their slight similarity with the giant phage Mad1_20_16 and LacPavin_0818_WC45 TShPs. The archaeal sheath proteins and homologous sequences are placed in several different clades, but groups the haloarchaeal myoviruses belonging to genera Haloferacalesvirus and Myohalovirus in two monophyletic branches according to the taxonomy. Two haloarchaeal prophage TShPs were found to be structurally similar to the Haloferacalesvirus and Myohalovirus phage proteins. It is noteworthy that this tree

Phylogenetic Analysis of Modelled Sheath Proteins
Multiple structural alignment with mTM-align algorithms [39,40] records pairwise TM-scores, a length-independent scoring function for measuring the similarity of two structures [67]. A matrix containing pairwise TM-scores was used for BioNJ clustering and for making inferences about a phylogenetic tree based on structural similarity ( Figure  10). This tree differentiates the phage tail sheath proteins and other sheath proteins from type VI secretory system proteins and shows their slight similarity with the giant phage Mad1_20_16 and LacPavin_0818_WC45 TShPs. The archaeal sheath proteins and homologous sequences are placed in several different clades, but groups the haloarchaeal myoviruses belonging to genera Haloferacalesvirus and Myohalovirus in two monophyletic branches according to the taxonomy. Two haloarchaeal prophage TShPs were found to be structurally similar to the Haloferacalesvirus and Myohalovirus phage proteins. It is noteworthy that this tree places most of the archaeal sequences in the branches adjacent to Jumbo phages infecting Gram-positive bacteria. Interestingly, this tree, based on structural similarity, indicates a closeness between the diffocin sheath protein from Peptoclostridium difficile and the tail sheath protein from Clostridium phage phiCDHM13 (genus Sherbrookevirus of the Myoviridae family). The proteins from the Herelleviridae and Chaseviridae families are placed in distinct clades, but two of the eight Ackermannviridae phages are in a separate branch and a different clade to the other six Ackermannviridae phages. Five of the six sheath proteins from anti-feeding prophages (AFP) are in a distinct clade adjacent to the clade containing Bacillus phage vB_BceM-HSE3 sheath proteins and homologous proteins found in archaeal and bacterial genomic sequences, but the remaining AFP sheath protein is in a different clade.
A phylogenetic analysis was performed based on the alignment of amino acid sequences, which included only the conserved domain, and used statistical methods such as bootstrap analysis for estimating the robustness of the tree. A total of 90 trimmed amino acid sequences for TShPs were used for the tree shown in Figure 11, and the full amino acid sequences of these proteins are depicted in Supplementary Figure S1. This tree demonstrated greater consistency with the taxonomy. Interestingly, this tree also often placed Jumbo phages infecting Gram-positive bacteria and archaeal phages closer to the root of the branches that included phages infecting Gram-negative bacteria. This tree also put the representatives of Haloferacalesvirus and Myohalovirus genera into distinct clades in the same way as the tree based on overall structural similarity. Although this tree indicates the relatedness of structural architecture and taxonomy, this relatedness is not absolute. For example, the number of domains of all modelled Peduovirinae sheath proteins was constant and equal to one, but the number of domains of Ackermannviridae TShPs varied.
It is also interesting that the topology of a phylogenetic tree constructed using the alignment of primary amino acid sequences for major capsid proteins ( Figure 12) shows a similar, but not identical, composition of the clades, and places archaeal Haloferacalesvirus and Myohalovirus viruses in distinct clades close to the phages infecting Gram-positive bacteria. The differences in topology might reflect both the problems with consistent phylogenetic analysis of highly divergent proteins and the consequences of the modular evolution of phages [68]. Phylogenies based on the large subunit of terminase (TerL) (Supplementary Figure S2) and tail tube protein (TTP) (Supplementary Figure S3) showed a similar situation with a partial closeness in the composition of clades and non-identical topology. The phylogenetic analysis of the tail tube protein had less bootstrap support than those for the TShP, MCP, and TerL phylogenies. This may be due to shorter sequences for the TTPs compared to the other listed proteins and the possibility of a comparatively high mutational rate. Interestingly, this tree, based on structural similarity, indicates a closeness between the diffocin sheath protein from Peptoclostridium difficile and the tail sheath protein from Clostridium phage phiCDHM13 (genus Sherbrookevirus of the Myoviridae family). The proteins from the Herelleviridae and Chaseviridae families are placed in distinct clades, but two of the eight Ackermannviridae phages are in a separate branch and a different clade to the

Discussion
The analysis of experimentally determined and modelled structures indicates the presence of a common conserved core inherent for all analysed sheath proteins, including phage TShPs, type VI secretion system sheath proteins, bacteriocin, and anti-feeding prophage sheath proteins. It is also noteworthy that bacterial flagellin contains a multidomain structure with a conserved core composed of N-and C-termini of a polypeptide chain important for the flagellin self-assembly, which somewhat resembles TShPs [69,70].
Most of the phage TShPs studied in this report differed from non-phage sheath proteins due to the presence of additional domains. These domains seem to be located away from the part of the TShP that is in contact with the tail tube. Such a location would prevent perturbation of the tail assembly and the function of the contractile mechanism. In this way, the evolution of phage TShPs was conditioned by its biological role. Phylogenetic analysis on the basis of structural similarity indicated the relatedness of T6SS sheath proteins and the TShP of giant phages. This may indicate the ancient divergence of phage sheath proteins and T6SS. Further, the similarity between anti-feeding prophage sheath proteins, the TShPs of Gram-positive bacteria, and the predicted sheath proteins for archaea may also have an ancient origin and were a consequence of the specialisation of AFP. In contrast, diffocin and pyocin sheath proteins appear to arise later and can be polyphyletic. As a minimum, the Peptoclostridium difficile diffocin sheath protein is structurally closer to Clostridium phages than it is to all other sheath proteins, and the Pseudomonas aeruginosa pyocin sheath protein is structurally similar to the TShPs of phages infecting beta-and gammaproteobacteria. For the future, the origin of phage-like contractile machines will require dedicated research using a larger representative group.
The size of the phage tail sheath proteins and the number of additional domains correlated with the size of the genome, such that small phages possessed shorter one-or two-domain TShPs. This observation seems reasonable, since the additional domains are not essential for the assembly and operation of the contractile mechanism, but they will consume resources for carrying the extra genetic material and protein synthesis during the infection. Large phages often have multi-domain TShPs. The suggestion, here, is that the ancestral form of phage TShPs possessed one main domain, and that during the evolution of phages, accompanied by an increase in genome size, phage TShPs acquired additional domains. This process may have some common features with the acquisition of additional functional genes during Jumbo phage evolution [71].
The necessity of expending additional resources as a result of having larger sheath proteins must be justified by competitive advantages provided by the additional domains. Most additional domains in the studied TShPs exhibited an immunoglobulin-like fold. It might be hypothesised that the presence of additional Ig-like TShP domains assists the adhesion of phages to the bacteria. Ig-like domains have been shown to be the subject of common horizontal exchange between diverse classes of both lytic and temperate phages, and Ig-like domains "may play an accessory role in phage infection by weakly interacting with carbohydrates on the bacterial cell surface" [72]. A further hypothesis might be that these domains can participate in the formation of tail appendages detected for some large phages [73], since Ig-like domains often participate in protein-protein interactions [74], which, in turn, also promote cell adhesion. It might also be possible that the presence of additional domains facilitates an increase in the stability of the virion, which is vital for phages [75,76], by cementing the assembled tail exploiting the interactions between additional domains of TShPs. This proposal agrees with the suggestion of Nováček et al. based on the analysis of the cryo-EM reconstruction of Staphylococcus phage 812 [48] that an additional domain (named "domain II" in [48]) makes contact with domain III (which is part of conserved core, according to present research) "from a neighboring tail sheath protein probably stabilizing the tail sheath protein disk". Interestingly, additional domains of phage T4 were supposed to be nonessential for tail sheath formation [1,77,78].
AlphaFold 2 software has shown an impressive level of accuracy in the modelling of proteins with experimentally determined tertiary structures. For 14 of the 15 structures, the RMSD was 0.59-1.33 Å. In one case (Staphylococcus phage 812), the RMSD was 3.27 Å for the contracted protein, and 2.83 Å for the native conformation. Low RMSD values could be a consequence of using templates, but comparative phylogenetic analysis using the structures of tail sheath proteins, sequences of major capsid proteins, and large subunits of terminase showed an identical or similar composition of clades, at least at the level of genera and subfamilies, supporting the results obtained from the AlphaFold 2 simulations.
It was previously shown that during contraction, the TShP subunits of phage T4 slide over each other with no apparent change in their structure [1], whereas the TShP of Staphylococcus phage 812 changes conformation during contraction [48]. For several analysed AlphaFold 2 models, the difference between the RMSD for experimentally determined structures in the contracted state and the models of the structures in an extended state was several-fold lower than the accuracy of the experimental methods. Therefore, it is hardly possible to draw conclusions if contracted or extended state is closer to the AlphaFold 2 model. Phylogenetic analysis using the major capsid protein and terminase has traditionally been used to reveal taxonomic and evolutionary relationships between bacteriophages [63,79]. Now it seems that the structural and phylogenetic analysis of TShP could help in clarifying the evolutionary history of phages. Moreover, the results of AlphaFold 2 predictions could also be used together with other analytical methods for elucidating the evolutionary history of proteins and bacteriophages. Conversely, phylogenetic analyses that do not take into account structural features can be based on an erroneous evolutionary history, and incorrect alignments can lead to a flawed phylogeny, even though the statistical analysis (e.g., bootstrap values) might indicate high branch support, according to the principle of "garbage in-garbage out". The incongruence of the topologies for different trees can be caused not only by the inaccuracy of structural predictions, but also by the independent evolution of different proteins as a consequence of modular evolution of bacteriophages [68]. Differences in the topology of phylogenetic trees can be observed for conserved proteins such as MCP and terminase, which was noted for some phages in previous research [80]. It is noteworthy that our phylogenetic analyses for tail sheath proteins and major capsid proteins resulted in the distant placement of archaeal viruses belonging to Haloferacalesvirus and Myohalovirus genera, yet they were closely related according to our terminase large subunit phylogeny.
The origin of archaeal phages is an exceptionally important question for evolutionary biology. During our search for homologs of sheath proteins in archaeal genomes, several putative sequences were found in the metagenomic assemblies of archaea, which had been classified as representatives of groups other than Haloarchaea. These findings could be the result of erroneous metagenomic binning, but a BLAST search also found homologs of myoviral major capsid proteins or terminase in dozens of draft genomes attributed as Aenigmarchaeota, Asgard group, Bathyarchaeota, Korarchaeota, Nanoarchaeota, Pacearchaeota, Thaumarchaeota, and Woesearchaeota, and homologs of myoviral terminase in the complete genomes of Candidatus Caldarchaeum subterraneum spp., Candidatus Fermentimicrarchaeum limneticum isolate Sv326, Candidatus Heimdallarchaeota archaeon spp., etc. It is known that the archaeal myoviruses isolated to date preferentially infect archaea from the Euryarchaeota phylum [81]. The probable presence of myoviral sequences in the genomes of other archaea could indicate a wider diversity for archaeal viruses than is currently expected. The presence of several domains in putative archaeal sheath proteins found in some presumably archaeal sequences (i.e., attributed as Crenarchaeota archaeon isolate LB_CRA_1 and Candidatus Pacearchaeota archaeon isolate ARS50) might suggest their prophage origin. The possible existence of myoviral prophages in non-euryarchaeal genomes needs very thorough analysis and verification. In addition, the results of our phylogenetic analysis of the structural similarity of sheath proteins suggests a polyphyletic origin for the predicted archaeal sheath proteins. If this assumption is correct, it is also possible that different groups of viruses with myoviral morphology existed before the divergence of the main archaeal and bacterial groups. Thus, the origin and early evolution of myoviruses requires dedicated evolutionary studies.

Conclusions
The results of our bioinformatic research on phage tail sheath proteins indicate the presence of a conserved core in all sheath proteins that is presumably responsible for tail assembly and the function of the myoviral contractile injection mechanism. The evolution of the phage tail sheath protein is accompanied by the incorporation of additional domains, many of which contain an immunoglobulin-like β-sandwiches fold. The functional requirements of the phage contractile injection system has resulted in the appearance of a specific structural architecture for the phage tail sheath proteins that includes the presence of a conserved domain, composed of both N-terminal and C-terminal parts in contact with the phage tail tube, and additional domains, which could facilitate adhesion to the host cell.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/v14061148/s1. Figure S1. Best-scoring maximum likelihood phylogenetic tree constructed with 90 amino acid sequences of phage TShPs aligned with the mTM-align structural alignment algorithm. The NCBI taxonomy is shown to the right of the phage name. The numbers near the tree branches indicate the fraction of the bootstrap trees supporting the branch. The total number of bootstrap trees was 1000. The scale bar shows 0.5 estimated substitutions per site and the tree was rooted to the midpoint. Figure S2. Best-scoring ML phylogenetic tree constructed with 90 amino acid sequences of phage terminase large subunit aligned with MAFFT. The NCBI taxonomy is shown to the right of the phage name. Total number of domains in the modelled structures of the corresponding tail sheath proteins are shown in the column to the right of the taxonomic assignment. The numbers near the tree branches indicate the fraction of the bootstrap trees supporting the branch. The total number of bootstrap trees was 1000. The scale bar shows 0.5 estimated substitutions per site and the tree was rooted to the midpoint. Figure S3. Best-scoring ML phylogenetic tree constructed with 90 amino acid sequences of phage tail tube protein aligned with MAFFT. The NCBI taxonomy is shown to the right of the phage name. Total number of domains in the modelled structures of the corresponding tail sheath proteins are shown in the column to the right of the taxonomic assignment. The numbers near the tree branches indicate the fraction of the bootstrap trees supporting the branch. The total number of bootstrap trees was 1000. The scale bar shows 0.5 estimated substitutions per site and the tree was rooted to the midpoint. File S1. Fasta sequences of the modelled proteins. File S2. Best-ranked PDB structures modelled with AlphaFold 2.